{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {},
    "id": "view-in-github"
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/NeuromatchAcademy/course-content/blob/master/tutorials/W1D4_MachineLearning/student/W1D4_Tutorial2.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a> &nbsp; <a href=\"https://kaggle.com/kernels/welcome?src=https://raw.githubusercontent.com/NeoNeuron/professional-workshop-3/master/tutorials/W4_ModelFitting/student/W4_TutorialBonus5.ipynb\" target=\"_parent\"><img src=\"https://kaggle.com/static/images/open-in-kaggle.svg\" alt=\"Open in Kaggle\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "# Model Fitting: Classifiers and regularizers\n",
    "\n",
    "__Original content creators:__ Pierre-Etienne H. Fiquet, Ari Benjamin, Jakob Macke\n",
    "\n",
    "__Content Modified:__ Kai Chen\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "In Tutorial 2, we learned about and implemented GLMs. In this bonus tutorial, we’ll implement logistic regression, a special case of GLMs used to model binary outcomes.\n",
    "Oftentimes the variable you would like to predict takes only one of two possible values. Left or right? Awake or asleep? Car or bus? In this tutorial, we will decode a mouse's left/right decisions from spike train data. Our objectives are to:\n",
    "1.\tLearn about logistic regression, how it is derived within the GLM theory, and how it is implemented in scikit-learn\n",
    "2.\tApply logistic regression to decode choies from neural responses\n",
    "3.\tLearn about regularization, including the different approaches and the influence of hyperparameters\n",
    "\n",
    "---\n",
    "We would like to acknowledge [Steinmetz _et al._ (2019)](https://www.nature.com/articles/s41586-019-1787-x) for sharing their data, a subset of which is used here.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "# Setup\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "both",
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.model_selection import cross_val_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "#@title Figure settings\n",
    "import ipywidgets as widgets\n",
    "\n",
    "%matplotlib inline\n",
    "%config InlineBackend.figure_format = 'retina'\n",
    "\n",
    "nma_style = {\n",
    "    'figure.figsize' : (8, 6),\n",
    "    'figure.autolayout' : True,\n",
    "    'font.size' : 15,\n",
    "    'xtick.labelsize' : 'small',\n",
    "    'ytick.labelsize' : 'small',\n",
    "    'legend.fontsize' : 'small',\n",
    "    'axes.spines.top' : False,\n",
    "    'axes.spines.right' : False,\n",
    "    'xtick.major.size' : 5,\n",
    "    'ytick.major.size' : 5,\n",
    "}\n",
    "for key, value in nma_style.items():\n",
    "    plt.rcParams[key] = value\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "#@title Helper functions\n",
    "\n",
    "def plot_weights(models, sharey=True):\n",
    "  \"\"\"Draw a stem plot of weights for each model in models dict.\"\"\"\n",
    "  n = len(models)\n",
    "  f = plt.figure(figsize=(10, 2.5 * n))\n",
    "  axs = f.subplots(n, sharex=True, sharey=sharey)\n",
    "  axs = np.atleast_1d(axs)\n",
    "\n",
    "  for ax, (title, model) in zip(axs, models.items()):\n",
    "\n",
    "    ax.margins(x=.02)\n",
    "    stem = ax.stem(model.coef_.squeeze(), use_line_collection=True)\n",
    "    stem[0].set_marker(\".\")\n",
    "    stem[0].set_color(\".2\")\n",
    "    stem[1].set_linewidths(.5)\n",
    "    stem[1].set_color(\".2\")\n",
    "    stem[2].set_visible(False)\n",
    "    ax.axhline(0, color=\"C3\", lw=3)\n",
    "    ax.set(ylabel=\"Weight\", title=title)\n",
    "  ax.set(xlabel=\"Neuron (a.k.a. feature)\")\n",
    "  f.tight_layout()\n",
    "\n",
    "\n",
    "def plot_function(f, name, var, points=(-10, 10)):\n",
    "    \"\"\"Evaluate f() on linear space between points and plot.\n",
    "\n",
    "    Args:\n",
    "      f (callable): function that maps scalar -> scalar\n",
    "      name (string): Function name for axis labels\n",
    "      var (string): Variable name for axis labels.\n",
    "      points (tuple): Args for np.linspace to create eval grid.\n",
    "    \"\"\"\n",
    "    x = np.linspace(*points)\n",
    "    ax = plt.figure().subplots()\n",
    "    ax.plot(x, f(x))\n",
    "    ax.set(\n",
    "      xlabel=f'${var}$',\n",
    "      ylabel=f'${name}({var})$'\n",
    "    )\n",
    "\n",
    "\n",
    "def plot_model_selection(C_values, accuracies):\n",
    "  \"\"\"Plot the accuracy curve over log-spaced C values.\"\"\"\n",
    "  ax = plt.figure().subplots()\n",
    "  ax.set_xscale(\"log\")\n",
    "  ax.plot(C_values, accuracies, marker=\"o\")\n",
    "  best_C = C_values[np.argmax(accuracies)]\n",
    "  ax.set(\n",
    "      xticks=C_values,\n",
    "      xlabel=\"$C$\",\n",
    "      ylabel=\"Cross-validated accuracy\",\n",
    "      title=f\"Best C: {best_C:1g} ({np.max(accuracies):.2%})\",\n",
    "  )\n",
    "\n",
    "def plot_non_zero_coefs(C_values, non_zero_l1, n_voxels):\n",
    "  \"\"\"Plot the accuracy curve over log-spaced C values.\"\"\"\n",
    "  ax = plt.figure().subplots()\n",
    "  ax.set_xscale(\"log\")\n",
    "  ax.plot(C_values, non_zero_l1, marker=\"o\")\n",
    "  ax.set(\n",
    "    xticks=C_values,\n",
    "    xlabel=\"$C$\",\n",
    "    ylabel=\"Number of non-zero coefficients\",\n",
    "  )\n",
    "  ax.axhline(n_voxels, color=\".1\", linestyle=\":\")\n",
    "  ax.annotate(\"Total\\n# Neurons\", (C_values[0], n_voxels * .98), va=\"top\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "#@title Data retrieval and loading\n",
    "import os\n",
    "import requests\n",
    "import hashlib\n",
    "\n",
    "url = \"https://osf.io/r9gh8/download\"\n",
    "fname = \"W1D4_steinmetz_data.npz\"\n",
    "expected_md5 = \"d19716354fed0981267456b80db07ea8\"\n",
    "\n",
    "if not os.path.isfile(fname):\n",
    "  try:\n",
    "    r = requests.get(url)\n",
    "  except requests.ConnectionError:\n",
    "    print(\"!!! Failed to download data !!!\")\n",
    "  else:\n",
    "    if r.status_code != requests.codes.ok:\n",
    "      print(\"!!! Failed to download data !!!\")\n",
    "    elif hashlib.md5(r.content).hexdigest() != expected_md5:\n",
    "      print(\"!!! Data download appears corrupted !!!\")\n",
    "    else:\n",
    "      with open(fname, \"wb\") as fid:\n",
    "        fid.write(r.content)\n",
    "\n",
    "def load_steinmetz_data(data_fname=fname):\n",
    "\n",
    "  with np.load(data_fname) as dobj:\n",
    "    data = dict(**dobj)\n",
    "\n",
    "  return data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "---\n",
    "\n",
    "# Section 1: Logistic regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "Logistic Regression is a binary classification model. It is a GLM with a *logistic* link function and a *Bernoulli* (i.e. coinflip) noise model.\n",
    "\n",
    "Like in the last notebook, logistic regression invokes a standard procedure:\n",
    "\n",
    "1.   Define a *model* of how inputs relate to outputs.\n",
    "2.   Adjust the parameters to maximize (log) probability of your data given your model\n",
    "\n",
    "## Section 1.1: The logistic regression model\n",
    "\n",
    "The fundamental input/output equation of logistic regression is:\n",
    "\n",
    "\n",
    "$$ \\hat{y} \\equiv p(y=1|x,\\theta) = \\sigma(\\theta^Tx)$$\n",
    "\n",
    "Note that we interpret the output of logistic regression, $\\hat{y}$, as the **probability that y = 1** given inputs $x$ and parameters $\\theta$.\n",
    "\n",
    "Here $\\sigma()$ is a \"squashing\" function called the **sigmoid function** or **logistic function**. Its output is in the range $0 \\leq y \\leq 1$. It looks like this:\n",
    "\n",
    "$$\\sigma(z) = \\frac{1}{1 + \\textrm{exp}(-z)}$$\n",
    "\n",
    "Recall that $z = \\theta^T x$. The parameters decide whether $\\theta^T x$ will be very negative, in which case $\\sigma(\\theta^T x)\\approx 0$, or very positive, meaning  $\\sigma(\\theta^T x)\\approx 1$.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "### Exercise 1: implement the sigmoid function\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "both",
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "def sigmoid(z):\n",
    "  \"\"\"Return the logistic transform of z.\"\"\"\n",
    "  ##############################################################################\n",
    "  # TODO for students: Fill in the missing code (...) and remove the error\n",
    "  raise NotImplementedError(\"Student excercise: implement the sigmoid function\")\n",
    "  ##############################################################################\n",
    "  return ...\n",
    "\n",
    "# Uncomment to test your sigmoid function\n",
    "# plot_function(sigmoid, \"\\sigma\", \"z\", (-10, 10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "[*Click for solution*](https://github.com/NeoNeuron/professional-workshop-3/tree/master//tutorials/W4_ModelFitting/solutions/W4_TutorialBonus5_Solution_86cdb043.py)\n",
    "\n",
    "*Example output:*\n",
    "\n",
    "<img alt='Solution hint' align='left' width=1116.0 height=828.0 src=https://raw.githubusercontent.com/NeoNeuron/professional-workshop-3/master/tutorials/W4_ModelFitting/static/W4_TutorialBonus5_Solution_86cdb043_0.png>\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "## Section 1.2: Using scikit-learn\n",
    "\n",
    "Unlike the previous notebook, we're not going to write the code that implements all of the Logistic Regression model itself. Instead, we're going to use the implementation in [scikit-learn](https://scikit-learn.org/stable/), a very popular library for Machine Learning.\n",
    "\n",
    "The goal of this next section is to introduce `scikit-learn` classifiers and understand how to apply it to real neural data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "#Section 2: Decoding neural data with logistic regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "## Section 2.1: Setting up the data\n",
    "\n",
    "In this notebook we'll use the Steinmetz dataset that you have seen previously. Recall that this dataset includes recordings of neurons as mice perform a decision task. \n",
    "\n",
    "Mice had the task of turning a wheel to indicate whether they perceived a Gabor stimulus to the left, to the right, or not at all. Neuropixel probes measured spikes across the cortex. Check out the following task schematic from the BiorXiv preprint:\n",
    "\n",
    "<img src='http://kordinglab.com/images/others/steinmetz-task.png' width= '200'/>\n",
    "\n",
    "Today we're going to **decode the decision from neural data** using Logistic Regression. We will only consider trials where the mouse chose \"Left\" or \"Right\" and ignore NoGo trials.\n",
    "\n",
    "### Data format\n",
    "\n",
    "In the hidden `Data retrieval and loading` cell, there is a function that loads the data:\n",
    "\n",
    "- `spikes`: an array of normalized spike rates with shape `(n_trials, n_neurons)`\n",
    "- `choices`: a vector of 0s and 1s, indicating the animal's behavioral response, with length `n_trials`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "data = load_steinmetz_data()\n",
    "for key, val in data.items():\n",
    "  print(key, val.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "As with the GLMs you've seen in the previous tutorial (Linear and Poisson Regression), we will need two data structures:\n",
    "\n",
    "- an `X` matrix with shape `(n_samples, n_features)`\n",
    "- a `y` vector with length `n_samples`.\n",
    "\n",
    "In the previous notebook, `y` corresponded to the neural data, and `X` corresponded to something about the experiment. Here, we are going to invert those relationships. That's what makes this a *decoding* model: we are going to predict behavior (`y`) from the neural responses (`X`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "y = data[\"choices\"]\n",
    "X = data[\"spikes\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "## Section 2.2: Fitting the model\n",
    "\n",
    "Using a Logistic Regression model within `scikit-learn` is very simple. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "# First define the model\n",
    "log_reg = LogisticRegression(penalty=\"none\")\n",
    "\n",
    "#Then fit it to data\n",
    "log_reg.fit(X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "There's two steps here:\n",
    "\n",
    "- We *initialized* the model with a hyperparameter, telling it what penalty to use (we'll focus on this in the second part of the notebook)\n",
    "- We *fit* the model by passing it the `X` and `y` objects.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "## Section 2.3: Classifying the training data\n",
    "\n",
    "Fitting the model performs maximum likelihood optimization, learning a set of *feature weights*. We can use those learned weights to *classify* new data, or predict the labels for each sample:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "y_pred = log_reg.predict(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "## Section 2.4: Evaluating the model\n",
    "\n",
    "Now we need to evaluate the model's predictions. We'll do that with an *accuracy* score. The accuracy of the classifier is the proportion of trials where the predicted label matches the true label.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "### Excercise 2: classifier accuracy\n",
    "\n",
    "For the first execise, implement a function to evaluate a classifier using the accuracy score. Use it to get the accuracy of the classifer on the *training* data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "def compute_accuracy(X, y, model):\n",
    "  \"\"\"Compute accuracy of classifier predictions.\n",
    "\n",
    "  Args:\n",
    "    X (2D array): Data matrix\n",
    "    y (1D array): Label vector\n",
    "    model (sklearn estimator): Classifier with trained weights.\n",
    "\n",
    "  Returns:\n",
    "    accuracy (float): Proportion of correct predictions.\n",
    "  \"\"\"\n",
    "  #############################################################################\n",
    "  # TODO Complete the function, then remove the next line to test it\n",
    "  raise NotImplementedError(\"Implement the compute_accuracy function\")\n",
    "  #############################################################################\n",
    "\n",
    "  y_pred = model.predict(X)\n",
    "  accuracy = ...\n",
    "\n",
    "  return accuracy\n",
    "\n",
    "# Uncomment and run to test your function:\n",
    "# train_accuracy = compute_accuracy(X, y, log_reg)\n",
    "# print(f\"Accuracy on the training data: {train_accuracy:.2%}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "[*Click for solution*](https://github.com/NeoNeuron/professional-workshop-3/tree/master//tutorials/W4_ModelFitting/solutions/W4_TutorialBonus5_Solution_9fed1228.py)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "## Section 2.5: Cross-validating the classifer\n",
    "\n",
    "Classification accuracy on the training data is 100%! That might sound impressive, but you should recall from yesterday the concept of *overfitting*: the classifier may have learned something idiosyncratic about the training data. If that's the case, it won't have really learned the underlying data->decision function, and thus won't generalize well to new data.\n",
    "\n",
    "To check this, we can evaluate the *cross-validated* accuracy.\n",
    "\n",
    "\n",
    "<img src='http://kordinglab.com/images/others/justCV-01.png' width= '700'/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "### Cross-validating using `scikit-learn` helper functions\n",
    "\n",
    "Yesterday, we asked you to write your own functions for implementing cross-validation. In practice, this won't be necessary, because `scikit-learn` offers a number of [helpful functions](https://scikit-learn.org/stable/model_selection.html) that will do this for you. For example, you can cross-validate a classifer using `cross_val_score`.\n",
    "\n",
    "`cross_val_score` takes a `sklearn` model like `LogisticRegression`, as well as your `X` and `y` data. It then retrains your model on test/train splits of `X` and `y`, and returns the test accuracy on each of the test sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "accuracies = cross_val_score(LogisticRegression(penalty='none'), X, y, cv=8) # k=8 crossvalidation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "#@title\n",
    "#@markdown Run to plot out these `k=8` accuracy scores.\n",
    "f, ax = plt.subplots(figsize=(8, 3))\n",
    "ax.boxplot(accuracies, vert=False, widths=.7)\n",
    "ax.scatter(accuracies, np.ones(8))\n",
    "ax.set(\n",
    "  xlabel=\"Accuracy\",\n",
    "  yticks=[],\n",
    "  title=f\"Average test accuracy: {accuracies.mean():.2%}\"\n",
    ")\n",
    "ax.spines[\"left\"].set_visible(False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "The lower cross-validated accuracy compared to the training accuracy (100%) suggests that the model is being *overfit*. Is this surprising? Think about the shape of the $X$ matrix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "X.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "The model has almost three times as many features as samples. This is a situation where overfitting is very likely (almost guaranteed).\n",
    "\n",
    "**Link to neuroscience**: Neuro data commonly has more features than samples. Having more neurons than independent trials is one example. In fMRI data, there are commonly more measured voxels than independent trials.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "\n",
    "### Why more features than samples leads to overfitting\n",
    "\n",
    "In brief, the variance of model estimation increases when there are more features than samples. That is, you would get a very different model every time you get new data and run `.fit()`. This is very related to the *bias/variance tradeoff* you learned about on day 1. \n",
    "\n",
    "Why does this happen? Here's a tiny example to get your intuition going. Imagine trying to find a best-fit line in 2D when you only have 1 datapoint. There are simply a infinite number of lines that pass through that point. This is the situation we find ourselves in with more features than samples.\n",
    "\n",
    "### What we can do about it\n",
    "As you learned on day 1, you can decrease model variance if you don't mind increasing its bias. Here, we will increase bias by assuming that the correct parameters are all small. In our 2D example, this is like prefering the horizontal line to all others. This is one example of *regularization*. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "-----\n",
    "\n",
    "# Section 3: Regularization\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "Regularization forces a model to learn a set solutions you *a priori* believe to be more correct, which reduces overfitting because it doesn't have as much flexibility to fit idiosyncracies in the training data. This adds model bias, but it's a good bias because you know (maybe) that parameters should be small or mostly 0.\n",
    "\n",
    "In a GLM, a common form of regularization is to *shrinking* the classifier weights. In a linear model, you can see its effect by plotting the weights. We've defined a helper function, `plot_weights`, that we'll use extensively in this section.\n",
    "\n",
    "Here is what the weights look like for a Logistic Regression model with no regularization:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "log_reg = LogisticRegression(penalty=\"none\").fit(X, y)\n",
    "plot_weights({\"No regularization\": log_reg})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "It's important to understand this plot. Each dot visualizes each a value in our parameter vector $\\theta$. (It's the same style of plot as the one showing $\\theta$ in the video). Since each feature is the time-averaged response of a neuron, each dot shows how the model uses each neuron to estimate a decision.\n",
    "\n",
    "Note the scale of the y-axis. Some neurons have values of about $20$, whereas others scale to $-20$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "## Section 3.1: $L_2$ regularization\n",
    "\n",
    "Regularization comes in different flavors. A very common one uses an $L_2$ or \"ridge\" penalty. This changes the objective function to\n",
    "\n",
    "$$-\\log\\mathcal{L}'(\\theta | X, y)=\n",
    "-\\log\\mathcal{L}(\\theta | X, y) +\\frac\\beta2\\sum_i\\theta_i^2,\n",
    "$$\n",
    "\n",
    "where $\\beta$ is a *hyperparameter* that sets the *strength* of the regularization.\n",
    "\n",
    "You can use regularization in `scikit-learn` by changing the `penalty`, and you can set the strength of the regularization with the `C` hyperparameter ($C = \\frac{1}{\\beta}$, so this sets the *inverse* regularization).\n",
    "\n",
    "Let's compare the unregularized classifier weights with the classifier weights when we use the default `C = 1`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "log_reg_l2 = LogisticRegression(penalty=\"l2\", C=1).fit(X, y)\n",
    "\n",
    "# now show the two models\n",
    "models = {\n",
    "  \"No regularization\": log_reg,\n",
    "  \"$L_2$ (C = 1)\": log_reg_l2,\n",
    "}\n",
    "plot_weights(models)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "Using the same scale for the two y axes, it's almost impossible to see the $L_2$ weights. Let's allow the y axis scales to adjust to each set of weights:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "plot_weights(models, sharey=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "\n",
    "Now you can see that the weights have the same basic pattern, but the regularized weights are an order-of-magnitude smaller.\n",
    "\n",
    "### Interactive Demo: The effect of varying C on parameter size\n",
    "\n",
    "We can use this same approach to see how the weights depend on the *strength* of the regularization:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "#@title\n",
    "\n",
    "#@markdown Execute this cell to enable the widget!\n",
    "\n",
    "# Precompute the models so the widget is responsive\n",
    "log_C_steps = 1, 10, 1\n",
    "penalized_models = {}\n",
    "for log_C in np.arange(*log_C_steps, dtype=int):\n",
    "  m = LogisticRegression(\"l2\", C=10 ** log_C, max_iter=5000)\n",
    "  penalized_models[log_C] = m.fit(X, y)\n",
    "\n",
    "@widgets.interact\n",
    "def plot_observed(log_C=log_C_steps):\n",
    "  models = {\n",
    "    \"No regularization\": log_reg,\n",
    "    f\"$L_2$ (C = $10^{log_C}$)\": penalized_models[log_C]\n",
    "  }\n",
    "  plot_weights(models)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "Recall from above that $C=\\frac1\\beta$ so larger `C` is less regularization. The top panel corresponds to $C=\\infty$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "## Section 3.2: $L_1$ regularization\n",
    "\n",
    "$L_2$ is not the only option for regularization. There is also the $L_1$, or \"Lasso\" penalty. This changes the objective function to\n",
    "\n",
    "$$\n",
    "-\\log\\mathcal{L}'(\\theta | X, y)=\n",
    "-\\log\\mathcal{L}(\\theta | X, y) +\\frac\\beta2\\sum_i|\\theta_i|\n",
    "$$\n",
    "\n",
    "In practice, using the summed absolute values of the weights causes *sparsity*: instead of just getting smaller, some of the weights will get forced to $0$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "log_reg_l1 = LogisticRegression(penalty=\"l1\", C=1, solver=\"saga\", max_iter=5000)\n",
    "log_reg_l1.fit(X, y)\n",
    "models = {\n",
    "  \"$L_2$ (C = 1)\": log_reg_l2,\n",
    "  \"$L_1$ (C = 1)\": log_reg_l1,\n",
    "}\n",
    "plot_weights(models)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "Note: You'll notice that we added two additional parameters: `solver=\"saga\"` and `max_iter=5000`. The `LogisticRegression` class can use several different optimization algorithms (\"solvers\"), and not all of them support the $L_1$ penalty. At a certain point, the solver will give up if it hasn't found a minimum value. The `max_iter` parameter tells it to make more attempts; otherwise, we'd see an ugly warning about \"convergence\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "# Section 4: The key difference between $L_1$ and $L_2$ regularization: sparsity\n",
    "\n",
    "When should you use $L_1$ vs. $L_2$ regularization? Both penalties shrink parameters, and both will help reduce overfitting. However, the models they lead to are different. \n",
    "\n",
    "In particular, the $L_1$ penalty encourages *sparse* solutions in which most parameters are 0. Let's unpack the notion of sparsity.\n",
    "\n",
    "A \"dense\" vector has mostly nonzero elements:\n",
    "$\\begin{bmatrix}\n",
    "  0.1 \\\\ -0.6\\\\-9.1\\\\0.07 \n",
    "\\end{bmatrix}$. \n",
    "A \"sparse\" vector has mostly zero elements:\n",
    "$\\begin{bmatrix}\n",
    "  0 \\\\ -0.7\\\\ 0\\\\0\n",
    "\\end{bmatrix}$.\n",
    "\n",
    "The same is true of matrices:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "#@title\n",
    "#@markdown Execute to plot a dense and a sparse matrix\n",
    "np.random.seed(50)\n",
    "n = 5\n",
    "M = np.random.random((n, n))\n",
    "M_sparse = np.random.choice([0,1], size=(n, n), p=[0.8, 0.2])\n",
    "\n",
    "fig, axs = plt.subplots(1, 2, sharey=True, figsize=(10,5))\n",
    "\n",
    "axs[0].imshow(M)\n",
    "axs[1].imshow(M_sparse)\n",
    "axs[0].axis('off')\n",
    "axs[1].axis('off')\n",
    "axs[0].set_title(\"A dense matrix\", fontsize=15)\n",
    "axs[1].set_title(\"A sparse matrix\", fontsize=15)\n",
    "text_kws = dict(ha=\"center\", va=\"center\")\n",
    "for i in range(n):\n",
    "  for j in range(n):\n",
    "    iter_parts = axs, [M, M_sparse], [\"{:.1f}\", \"{:d}\"]\n",
    "    for ax, mat, fmt in zip(*iter_parts):\n",
    "      val = mat[i, j]\n",
    "      color = \".1\" if val > .7 else \"w\"\n",
    "      ax.text(j, i, fmt.format(val), c=color, **text_kws)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "## Exercise 3: The effect of $L_1$ regularization on parameter sparsity\n",
    "\n",
    "Please complete the following function to fit a regularized `LogisticRegression` model and return **the number of coefficients in the parameter vector that are equal to 0**.\n",
    "\n",
    "Don't forget to check out the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "def count_non_zero_coefs(X, y, C_values):\n",
    "  \"\"\"Fit models with different L1 penalty values and count non-zero coefficients.\n",
    "\n",
    "  Args:\n",
    "    X (2D array): Data matrix\n",
    "    y (1D array): Label vector\n",
    "    C_values (1D array): List of hyperparameter values\n",
    "\n",
    "  Returns:\n",
    "    non_zero_coefs (list): number of coefficients in each model that are nonzero\n",
    "\n",
    "  \"\"\"\n",
    "  #############################################################################\n",
    "  # TODO Complete the function and remove the error\n",
    "  raise NotImplementedError(\"Implement the count_non_zero_coefs function\")\n",
    "  #############################################################################\n",
    "\n",
    "  non_zero_coefs = []\n",
    "  for C in C_values:\n",
    "\n",
    "    # Initialize and fit the model\n",
    "    # (Hint, you may need to set max_iter)\n",
    "    model = ...\n",
    "    ...\n",
    "\n",
    "    # Get the coefs of the fit model\n",
    "    coefs = ...\n",
    "\n",
    "    # Count the number of non-zero elements in coefs\n",
    "    non_zero = ...\n",
    "    non_zero_coefs.append(non_zero)\n",
    "\n",
    "  return non_zero_coefs\n",
    "\n",
    "# Use log-spaced values for C\n",
    "C_values = np.logspace(-4, 4, 5)\n",
    "\n",
    "# Uncomment and run when the function is ready for testing\n",
    "# non_zero_l1 = count_non_zero_coefs(X, y, C_values)\n",
    "# plot_non_zero_coefs(C_values, non_zero_l1, n_voxels=X.shape[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "[*Click for solution*](https://github.com/NeoNeuron/professional-workshop-3/tree/master//tutorials/W4_ModelFitting/solutions/W4_TutorialBonus5_Solution_c3192636.py)\n",
    "\n",
    "*Example output:*\n",
    "\n",
    "<img alt='Solution hint' align='left' width=1116.0 height=827.0 src=https://raw.githubusercontent.com/NeoNeuron/professional-workshop-3/master/tutorials/W4_ModelFitting/static/W4_TutorialBonus5_Solution_c3192636_0.png>\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "Smaller `C` (bigger $\\beta$) leads to sparser solutions.\n",
    "\n",
    "**Link to neuroscience**: When is it OK to assume that the parameter vector is sparse? Whenever it is true that most features don't affect the outcome. One use-case might be decoding low-level visual features from whole-brain fMRI: we may expect only voxels in V1 and thalamus should be used in the prediction.\n",
    "\n",
    "**WARNING**: be careful when interpreting $\\theta$. Never interpret the nonzero coefficients as *evidence* that only those voxels/neurons/features carry information about the outcome. This is a product of our regularization scheme, and thus *our prior assumption that the solution is sparse*. Other regularization types or models may find very distributed relationships across the brain. Never use a model as evidence for a phenomena when that phenomena is encoded in the assumptions of the model. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "---\n",
    "\n",
    "## Section 4.1: Choosing the regularization penalty\n",
    "\n",
    "In the examples above, we just picked arbitrary numbers for the strength of regularization. How do you know what value of the hyperparameter to use?\n",
    "\n",
    "The answer is the same as when you want to know whether you have learned good parameter values: use cross-validation. The best hyperparameter will be the one that allows the model to generalize best to unseen data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "### Exercise 4: Model selection\n",
    "\n",
    "In the final exercise, we will use cross-validation to evaluate a set of models, each with a different $L_2$ penalty. Your `model_selection` function should have a for-loop that gets the mean cross-validated accuracy for each penalty value (use the `cross_val_score` function that we introduced above)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab_type": "code",
    "execution": {}
   },
   "outputs": [],
   "source": [
    "def model_selection(X, y, C_values):\n",
    "  \"\"\"Compute CV accuracy for each C value.\n",
    "\n",
    "  Args:\n",
    "    X (2D array): Data matrix\n",
    "    y (1D array): Label vector\n",
    "    C_values (1D array): Array of hyperparameter values\n",
    "\n",
    "  Returns:\n",
    "    accuracies (1D array): CV accuracy with each value of C\n",
    "\n",
    "  \"\"\"\n",
    "  #############################################################################\n",
    "  # TODO Complete the function and remove the error\n",
    "  raise NotImplementedError(\"Implement the model_selection function\")\n",
    "  #############################################################################\n",
    "\n",
    "  accuracies = []\n",
    "  for C in C_values:\n",
    "\n",
    "    # Initialize and fit the model\n",
    "    # (Hint, you may need to set max_iter)\n",
    "    model = ...\n",
    "\n",
    "    # Get the accuracy for each test split\n",
    "    accs = ...\n",
    "\n",
    "    # Store the average test accuracy for this value of C\n",
    "    accuracies.append(...)\n",
    "\n",
    "  return accuracies\n",
    "\n",
    "# Use log-spaced values for C\n",
    "C_values = np.logspace(-4, 4, 9)\n",
    "\n",
    "# Uncomment and run when the function is ready to test\n",
    "# accuracies = model_selection(X, y, C_values)\n",
    "# plot_model_selection(C_values, accuracies)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "[*Click for solution*](https://github.com/NeoNeuron/professional-workshop-3/tree/master//tutorials/W4_ModelFitting/solutions/W4_TutorialBonus5_Solution_988f1713.py)\n",
    "\n",
    "*Example output:*\n",
    "\n",
    "<img alt='Solution hint' align='left' width=1116.0 height=827.0 src=https://raw.githubusercontent.com/NeoNeuron/professional-workshop-3/master/tutorials/W4_ModelFitting/static/W4_TutorialBonus5_Solution_988f1713_0.png>\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "This plot suggests that the right value of $C$ does matter — up to a point. Remember that C is the *inverse* regularization. The plot shows that models where the regularization was too strong (small C values) performed very poorly. For $C > 10^{-2}$, the differences are marginal, but the best performance was obtained with an intermediate value ($C \\approx 10^1$)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "# Summary\n",
    "\n",
    "In this notebook, we learned about Logistic Regression, a fundamental algorithm for *classification*. We applied the algorithm to a *neural decoding* problem: we tried to predict an animal's behavioral choice from its neural activity. We saw again how important it is to use *cross-validation* to evaluate complex models that are at risk for *overfitting*, and we learned how *regularization* can be used to fit models that generalize better. Finally, we learned about some of the different options for regularization, and we saw how cross-validation can be useful for *model selection*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "-----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "# Appendix: The Logistic Regression model in full\n",
    "\n",
    "The fundamental input/output equation of logistic regression is:\n",
    "\n",
    "$$p(y_i = 1 |x_i, \\theta) = \\sigma(\\theta^Tx_i)$$\n",
    "\n",
    "## The logistic link function\n",
    "\n",
    "You've seen $\\theta^T x_i$ before, but the $\\sigma$ is new. It's the *sigmoidal* or *logistic* link function that \"squashes\" $\\theta^T x_i$ to keep it between $0$ and $1$:\n",
    "\n",
    "$$\\sigma(z) = \\frac{1}{1 + \\textrm{exp}(-z)}$$\n",
    "\n",
    "## The Bernoulli likelihood\n",
    "\n",
    "You might have noticed that the output of the sigmoid, $\\hat{y}$ is not a binary value (0 or 1), even though the true data $y$ is! Instead, we interpret the value of $\\hat{y}$ as the *probability that y = 1*:\n",
    "\n",
    "$$ \\hat{y_i} \\equiv p(y_i=1|x_i,\\theta) = \\frac{1}{{1 + \\textrm{exp}(-\\theta^Tx_i)}}$$\n",
    "\n",
    "To get the likelihood of the parameters, we need to define *the probability of seeing $y$ given $\\hat{y}$*. In logistic regression, we do this using the Bernoulli distribution:\n",
    "\n",
    "$$P(y_i\\ |\\ \\hat{y}_i) = \\hat{y}_i^{y_i}(1 - \\hat{y}_i)^{(1 - y_i)}$$\n",
    "\n",
    "So plugging in the regression model:\n",
    "\n",
    "$$P(y_i\\ |\\ \\theta, x_i) = \\sigma(\\theta^Tx_i)^{y_i}(1 - \\sigma(\\theta^Tx_i))^{(1 - y_i)}.$$\n",
    "\n",
    "This expression effectively measures how good our parameters $\\theta$ are. We can also write it as the likelihood of the parameters given the data:\n",
    "\n",
    "$$\\mathcal{L}(\\theta\\ |\\ y_i, x_i) = P(y_i\\ |\\ \\theta, x_i),$$\n",
    "\n",
    "and then use this as a target of optimization, considering all of the trials independently:\n",
    "\n",
    "$$\\textrm{log}\\mathcal{L}(\\theta | X, y) = \\sum_{i=1}^Ny_i\\textrm{log}(\\sigma(\\theta^Tx_i))\\ +\\ (1-y_i)\\textrm{log}(1 - \\sigma(\\theta^Tx_i)).$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "# Appendix: More detail about model selection\n",
    "\n",
    "In the final exercise, we used all of the data to choose the hyperparameters. That means we don't have any fresh data left over to evaluate the performance of the selected model. In practice, you would want to have two *nested* layers of cross-validation, where the final evaluation is performed on data that played no role in selecting or training the model.\n",
    "\n",
    "Indeed, the proper method for splitting your data to choose hyperparameters can get confusing. Here's a guide that the authors of this notebook developed while writing a tutorial on using machine learning for neural decoding (https://arxiv.org/abs/1708.00909).\n",
    "\n",
    "<img src='http://kordinglab.com/images/others/CV-01.png' width= '700'/>\n"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "include_colab_link": true,
   "name": "W4_TutorialBonus5",
   "provenance": [],
   "toc_visible": true
  },
  "interpreter": {
   "hash": "9516f62da91337f10c2adbe814d9c63a4b08f8271333386358218606edb781e3"
  },
  "kernel": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "kernelspec": {
   "display_name": "Python 3.7.11 64-bit ('pw3': conda)",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
